Multi-modal electronic document classification

ABSTRACT

A method comprising operating at least one hardware processor for: receiving, as input, a plurality of electronic documents, training a machine learning classifier based, at least on part, on a training set comprising: (i) labels associated with the electronic documents, (ii) raw text from each of said plurality of electronic documents, and (iii) a rasterized version of each of said plurality of electronic documents, and applying said machine learning classifier to classify one or more new electronic documents.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.16/037,194, filed Jul. 17, 2018, and entitled “Multi-Modal ElectronicDocument Classification”, which claims the benefit of priority to U.S.Provisional Patent Application No. 62/698,168, filed Jul. 15, 2018, andentitled “Multi-Modal Electronic Document Classification”. The contentsof the above applications are all incorporated by reference as if fullyset forth herein in their entirety.

BACKGROUND

The invention relates to the field of machine learning.

Multi-modal representation of data may be beneficial in several machinelearning tasks, such as image captioning, visual question answering,multi-lingual data retrieval, and electronic document classification.This is because, in many instances, an amalgamation of multiple views ofan input sample is likely to capture more meaningful information than arepresentation that accounts for only a single modality. For example, inthe task of scene recognition in a video, video data generally iscomprised of video frames (images) along with audio. Images and audiothus comprise two different representations of the same input sample,each with different representative features. By combining these twomodalities into a common subspace, classification of abstract scenesfrom the video can become more accurate.

The foregoing examples of the related art and limitations relatedtherewith are intended to be illustrative and not exclusive. Otherlimitations of the related art will become apparent to those of skill inthe art upon a reading of the specification and a study of the figures.

SUMMARY

The following embodiments and aspects thereof are described andillustrated in conjunction with systems, tools and methods which aremeant to be exemplary and illustrative, not limiting in scope.

There is provided, in accordance with an embodiment, method comprisingoperating at least one hardware processor for: receiving, as input, aplurality of electronic documents, training a machine learningclassifier based, at least on part, on a training set comprising: (i)labels associated with the electronic documents, (ii) raw text from eachof said plurality of electronic documents, and (iii) a rasterizedversion of each of said plurality of electronic documents, and applyingsaid machine learning classifier to classify one or more new electronicdocuments.

There is also provided, in accordance with an embodiment, a systemcomprising at least one hardware processor; and a non-transitorycomputer-readable storage medium having stored thereon programinstructions, the program instructions executable by the at least onehardware processor to: receive, as input, a plurality of electronicdocuments, train a machine learning classifier based, at least on part,on a training set comprising: (i) labels associated with the electronicdocuments, (ii) raw text from each of said plurality of electronicdocuments, and (iii) a rasterized version of each of said plurality ofelectronic documents, and apply said machine learning classifier toclassify one or more new electronic documents.

There is further provided, in accordance with an embodiment, a computerprogram product comprising a non-transitory computer-readable storagemedium having program instructions embodied therewith, the programinstructions executable by at least one hardware processor to: receive,as input, a plurality of electronic documents, train a machine learningclassifier based, at least on part, on a training set comprising: (i)labels associated with the electronic documents, (ii) raw text from eachof said plurality of electronic documents, and (iii) a rasterizedversion of each of said plurality of electronic documents, and applysaid machine learning classifier to classify one or more new electronicdocuments.

In some embodiments, said labels denote document categories.

In some embodiments, the method further comprises, and in the case ofthe system and computer program product, the instructions furthercomprise, with respect to each of said plurality of sample electronicdocuments, applying one or more neural networks (i) to said extractedtext, to generate a data representation of said extracted text as afixed length vector; and (ii) to said image, to generate a datarepresentation of said image which corresponds to a visual layout ofsaid sample electronic document.

In some embodiments, said one or more neural networks are selected fromthe group consisting of: Neural Bag-of-Words (NBOW), recurrent neuralnetwork (RNN), Recursive Neural Tensor Network (RNTN), Convolutionalneural network (CNN), Dynamic Convolutional Neural Network (DCNN), Longshort-term memory network (LSTM), and recursive neural network (RecNN).

In some embodiments, said one or more neural networks comprise one ormore hidden layers.

In some embodiments, the method further comprises, and in the case ofthe system and computer program product, the instructions furthercomprise, calculating a correlation between (i) said data representationof said extracted text, and (ii) said data representation of said image.

In some embodiments, the method further comprises, and in the case ofthe system and computer program product, the instructions furthercomprise, generating, with respect to each of said plurality ofelectronic documents, a combined data representation based, at least inpart, on (i) said data representation of said extracted text, (ii) saiddata representation of said image, and (iii) said correlation.

In some embodiments, said generating is based, at least in part, on acost function which: (i) minimizes an error of reconstructing saidextracted text from said data representation of said extracted text, andsaid image from said data representation of said image; (ii) minimizesan error of cross-reconstructing said extracted text from said datarepresentation of said image and said image from said datarepresentation of said extracted; and (iii) maximizes said correlationbetween said data representation of said extracted text, and said datarepresentation of said image.

In some embodiments, said combined data representation is provided as atraining input to said machine classifier.

In addition to the exemplary aspects and embodiments described above,further aspects and embodiments will become apparent by reference to thefigures and by study of the following detailed description.

BRIEF DESCRIPTION OF THE FIGURES

Exemplary embodiments are illustrated in referenced figures. Dimensionsof components and features shown in the figures are generally chosen forconvenience and clarity of presentation and are not necessarily shown toscale. The figures are listed below.

FIGS. 1A-1B illustrate a raster image of a document and raw textextracted therefrom;

FIGS. 2A-2C schematically illustrate a process for the creation of afusion representation of an electronic document, according to anembodiment; and

FIGS. 3A-3D illustrate various multi-modal classification models,according to an embodiment; and

FIG. 4 is a flowchart of the functional steps in a method for training amulti modal classifier, according to an embodiment.

DETAILED DESCRIPTION

Disclosed herein are a method, system, and computer program product fortraining a machine learning classifier to classify electronic documentsbased on a multi-modal training model. In some embodiments, the trainingmodel is based on at least two modalities: the raw textual content ofthe electronic documents, and the visual structure of the electronicdocuments (which may, in turn, include at least one of text styling andnon-textual graphics such as illustrations and photos).

As used herein, the term “electronic document” refers broadly to anydocument containing mainly text and stored in a computer-readableformat. Electronic document formats may include, among others, PortableDocument Format (PDF), Digital Visual Interface (DVI), PostScript, wordprocessing file formats, such as docx, doc, and Rich Text Format (RTF),and/or XML Paper Specification (XPS).

Document classification is a known task in the field of informationretrieval and machine learning, and it plays an important role in avariety of applications. Embodiments of the present invention maycontribute to enterprise content management, by ensuring that enterprisedocuments are stored in a logical, organized way that makes it easy toretrieve the information quickly and efficiently. An automatic documentclassification tool according to some embodiments can thus realize asignificant reduction in manual entry costs, and improve the speed andturnaround time for document processing. Such tools may be especiallyuseful for publishers, financial institutions, insurance companies,and/or any industry that deals with large amounts of content.

In some embodiments, the present invention may be configured forautomatic document classification based, at least in part, oncontent-based assignment of one or more predefined categories (classes)to documents. By classifying the content of a document, it may beassigned one or more predefined classes or categories, thus making iteasier to manage and sort.

Typically, multi-class machine learning classifiers are trained on atraining set of documents, where each document belongs to one of acertain number of distinct classes (e.g., invoices, scientific papers,resumes, letters). The training set may be labeled with the correctclasses (e.g., for supervised learning), or may not be labeled (e.g., inthe case of unsupervised learning). Following a training stage, theclassifier may be able to predict the most probable class for eachdocument in a test set of documents. Although document classificationmay be based on textual content alone, for some types of documents, thetask of classification can be significantly enhanced by also generatingfeatures from the visual structure of the document. This is based on theidea that documents in the same category often also share similar layoutand structure features.

Accordingly, in some embodiments, the present invention provides fortraining a multi-modal machine learning classifier on a training setcomprising a plurality of electronic documents, each represented by (i)its textual content (i.e., raw text), and (ii) a raster image thereof.In some cases, an electronic document in the training set may onlyinclude one representation of the two. In such cases, a preprocessingstage may be necessary to generate the second representation from thefirst. For example, a raster image of a document may be created using asuitable application, or raw text may be extracted from a raster imageusing, e.g., optical character recognition (OCR). FIG. 1, in panel A,illustrates an exemplary raster image of an electronic document. Inpanel B, raw text has been extracted from the raster image using, e.g.,OCR. In some cases, the preprocessing stage may further include scanninghard copies of documents to generate a raster image, from which raw textmay then be extracted using OCR.

In some embodiments, following a multi-modal training stage, a trainedclassifier of the present invention may be configured for classifyingelectronic documents based on a multi-modal input comprising bothrepresentations of the documents. In other embodiments, the trainedclassifier may be configured for classifying electronic documents basedon only a single modality input (e.g., textual content or raster imagealone), with improved classification accuracy as compared to aclassifier which has been trained solely based on a single modality.

In some embodiments, the present invention may employ one or more typesof neural networks to further generate data representations of themulti-modal inputs. For example, raw input text from an electronicdocument may be processed so as to generate a data representation of thetext as a fixed-length vector. Similarly, images of the electronicdocument (e.g., thumbnails or raster images) may be processed to extractimage features.

In some embodiments, the neural network models employed by the presentinvention to generate textual data representations may be selected fromthe group consisting of Neural Bag-of-Words (NBOW); recurrent neuralnetwork (RNN), Recursive Neural Tensor Network (RNTN); DynamicConvolutional Neural Network (DCNN); Long short-term memory network(LSTM); and recursive neural network (RecNN). See, e.g., Pengfei Liu etal., “Recurrent Neural Network for Text Classification with Multi-TaskLearning”, Proceedings of the Twenty-Fifth International JointConference on Artificial Intelligence (IJCAI-16). Convolutional neuralnetwork (CNN) may be used, e.g., to extract image features whichrepresent the physical visual structure of a document.

In some embodiments, the present invention may further be configured foremploying a common representation learning (CRL) framework, for learninga common representation of the two views of data (i.e., textual andvisual). CRL is associated with multi-view data that can be representedin multiple forms. The learned common representation can then be used totrain a model to reconstruct all the views of the data from each input.CRL of multi-view data can be categorized into two main categories:canonical-based approaches and autoencoder-based methods. CanonicalCorrelation Analysis (CCA)-based approaches comprise learning a jointrepresentation by maximizing correlation of the views when projected tothe common subspace. Autoencoder (AE) methods learn a commonrepresentation by minimizing the error of reconstructing the two views.AE-based approaches use deep neural networks that try to optimize twoobjective functions. The first objective is to find a compressed hiddenrepresentation of data in a low-dimensional vector space. The otherobjective is to reconstruct the original data from the compressedlow-dimensional subspace. Multi-modal autoencoders (MAE) aretwo-channeled models which specifically perform two types ofreconstructions. The first is the self-reconstruction of view fromitself, and the other is the cross-reconstruction where each view isreconstructed from the other. These reconstruction objectives provideMAE the ability to adapt towards transfer learning tasks as well. In thecontext of CRL, each of these approaches has its own advantages anddisadvantages. For example, though CCA based approaches outperform AEbased approaches for the task of transfer learning, they are not asscalable as the latter.

Accordingly, in some embodiments, the present invention may combineelements of both CCA and AE. In some embodiments, given datarepresentations of both textual and visual structure datarepresentations of an electronic document, the present invention may beconfigured for training a neural network to maximize reconstruction andcross-reconstruction abilities of both input sources, based, at least inpart, on an autoencoder framework. The present invention may thenprovide for extracting a correlation between the two representationsusing, e.g., a DCCA (deep CCA) paradigm, in order to create a fusionrepresentation which may optimally summarize both the textual contentand the visual structure data contained in the input electronicdocument. In some embodiments, the fusion representation of the presentmodel may further be applied as a direct input in an electronic documentclassification model.

In some embodiments, A Correlation Neural Networks (CorrNet) may betrained on maximizing a reconstruction ability with respect to eachrepresentation, as well as maximizing the correlation between the tworepresentations, thus producing a fusion representation of the document,which may be then used for enhanced classification purposes. Inaddition, in some embodiments, once the CorrNet classification model hasbeen trained on both representations, an enhanced classificationprediction may then be achieved using only a single input source. Inother words, the multi-modal trained classification model may allow fortextual content to be sufficient on its own as an input, to generateimproved classification accuracy, as compared to a model which has beentrained solely based on textual input.

FIG. 2 schematically illustrates a process for the creation of a fusionrepresentation of an electronic document, according to an embodiment. Insome embodiments, given a space

z=(x,y),

where x and y are a raw text and image inputs, respectively, of a sampleelectronic document, a correlational neural model of the presentinvention may be configured, at a step A, to create a compressedmulti-modal representation as follows:

h(z)=f(Wx+Vy+b),

where W, V are projection matrices, and b is a bias vector.

An output layer of the model may then aim to generate z′, which is areconstruction of z, from the hidden representations of x, y as follows:

z′=g([W′h(z),V′h(z)]+b′),

where W, V′ are reconstruction matrices and b′ is a bias vector.

In some embodiments, the present invention may thus be configured fortraining a correlational neural network to classify samples comprisingboth textual (x_(i)) and visual (y_(i)) content, based on a costfunction which, with respect to each sample in the given training set(x_(i),y_(i)), attempts to:

-   -   Minimize the self-reconstruction error (L₁), which is equal to        the errors in reconstructing x_(i) from x_(i) and y_(i) from        y_(i);    -   minimize the cross-reconstruction errors (L₂, L₃), which are        equal to the errors in reconstructing x_(i) from y_(i) and y_(i)        from x_(i); and    -   maximize the correlation (L₄) between the representations of        both views.

In some embodiments, using a regularization hyper-parameter λ forscaling the offset of the correlation, the cost function employed by thepresent model may be defined as:

${J_{z}(\theta)} = {\sum\limits_{i = 1}^{N}\left( {{{L\left( {z_{i},{g\left( {h\left( z_{i} \right)} \right)}} \right)} + {L\left( {z_{i},{g\left( {h\left( x_{i} \right)} \right)}} \right)} + {L\left( {z_{i},{g\left( {h\left( y_{i} \right)} \right)}} \right)} - {\lambda \mspace{14mu} {{corr}\left( {{h(X)},{h(Y)}} \right)}}},} \right.}$

where (i) L(z_(i),g(h(z_(i))) is the self-reconstruction error L₁; (ii)L(z_(i),g(h(x_(i))) and L(z_(i),g(h(y_(i))) are the cross-reconstructionerrors L₂, L₃; and (iii) λ corr(h(X), h(Y)) is the correlation L₄. Insome embodiments, the correlation L₄ between the two representations ofthe views may be computed as:

${{corr}\left( {{h(X)},{h(Y)}} \right)} = \frac{\sum_{i = 1}^{N}{\left( {{h\left( x_{i} \right)} - \overset{\_}{h(X)}} \right)\left( {{h\left( y_{i} \right)} - \overset{\_}{h(Y)}} \right)}}{\sqrt{\sum_{i = 1}^{N}{\left( {{h\left( x_{i} \right)} - \overset{\_}{h(X)}} \right)^{2}{\sum_{i = 1}^{N}\left( {{h\left( y_{i} \right)} - \overset{\_}{h(Y)}} \right)^{2}}}}}$

The above process can be further repeated to obtain a deeper neuralarchitecture, with hidden layers added to both the encoding and thedecoding phases, e.g., as shown at B and C in FIG. 2. Once thecorrelational neural model has been trained on the training set, thecorrelational representation achieved in the middle layer of the networkcan then be utilized for the classification task.

Accordingly, in some embodiments, the classification model of thepresent invention may be configured for unsupervised training, solelybased on document data points consisting of textual content and visualcontent, absent any training labels. In other embodiments, a variationof the classification model described above may comprise at least apartially-labelled training set. In such embodiments, the classificationmodel may incorporate the classification ability of the generatedrepresentation directly in the loss function, and create additionalfully connected layers on top of the representation inside thecorrelational model, ending with a softmax activation layer tocategorize each representation to its corresponding predicted class.

In some embodiments, the present invention comprises one or more typesof AE neural networks, wherein each type of AE may be best suited for adifferent type of representation encoding/decoding task. For example, aCNN may be used for creating representations of visual structure input,and an RNN or LSTM may be used for textual inputs.

FIG. 3A schematically illustrates a ‘branch’ classification modelaccording to an embodiment, which comprises two types of encoders: aLSTM for textual input (x), and a CNN for visual structure input (y). Ascan be seen, both inputs are encoded into fixed-sized vectorrepresentations, which are then fused into a single mutualrepresentation (e.g., a ‘merged’ layer in the middle), from which themodel is able to decode both visual and text inputs separately.

FIGS. 3B-3C schematically illustrate a CorrNet classification modelaccording to an embodiment of the present invention. The CorrNet modelcomprises calls to the ‘branch’ model described above with respect toFIG. 3A, in order to calculate the loss functions following thereconstruction processes described above. The loss functions considerthe difference between the original inputs and the reconstructedoutputs, offset by the correlations calculated in the process. In otherwords, the model is penalized for either creating a reconstruction toodifferent from the original input, or obtaining a reconstruction usinguncorrelated representations between the two views. During training, foreach input type, the CorrNet model reconstructs all the relevant viewsfrom either dual-view or single-view contained in the input, andevaluates the loss functions accordingly. In some embodiments, thecorrelation between the two representations of a sample can be computedat different layers of representations, by calling the ‘branch’ model.Specifically, the term L₄ noted above may be expanded into the sum ofthe following two calculated types of correlations:

-   -   Correlation of the mutual representation between the two        embeddings achieved when training the model on two separate        single-view inputs; and    -   correlation of the single-view representation achieved before        merging, when training the model on a dual-view input.

In some embodiments, during training, the loss functions described aboveis assigned weights, to direct the learning process towards the goals ina balanced manner.

Finally, FIG. 3D schematically illustrates yet another variation of theclassification model, according to an embodiment, where the middle layerachieved in the ‘branch’ model described above is used as direct inputfor the classification model.

FIG. 4 is a flowchart of the functional steps in a method for training amulti modal classifier, according to an embodiment. At 402, a sampleelectronic document is used as input with respect to a correlationalneural model of the present invention. At 404, raw text and a rasterimage of the sample document are extracted. At 406, textual and visualdata representations of the raw text and the image are generated. At408, a common fusion representation is generated base on the textualdata representation, the visual data representation, and the correlationtherebetween. Finally, at 410, the fusion representation is used adirect input in a document classifier.

Experimental Results

The present correlational neural model was tested on a training set oflabeled sample electronic documents from open and private sources,including over 500 documents from the following different categories(the numbers in parenthesis represent the number of samples from eachcategory):

Resumes (227);

invoices (154);

quarterly financial reports (50);

non-disclosure agreements (47); and

scientific papers (47).

For each sample, the raw text was extracted to be used as textual input,and a raster image was generated to be used as visual input. Forexperimental purposes, the data was randomly split into training andtesting sets by a ratio of 2 to 1, respectively. All models were trainedsolely on the train set (containing two-thirds of the data), and theirpredictive abilities were tested on the previously unseen test set.

For the purpose of examining the proposed classification method, thefollowing models were employed:

-   -   Branch-Model: Dual auto-encoder part of the network, as        illustrated with respect to FIG. 3A. This model encodes the        multi-modal view of each electronic document into a fixed-size        embedding vector.    -   Correlational Network: This model runs multiple sessions of the        branch model described above, each time with the relevant input        containing one or more of the views, to calculate the relevant        loss functions as well as the correlations between the relevant        views.    -   Classification model: This model receives as input the embedding        vectors from the correlational network, and returns a predicted        class of an input electronic document.

For comparison purposes, two reference single-modal models wereemployed, each with respect to one of the modalities used in the presentmulti-modal model (i.e., either text or image). Each of thesingle-modality models was based on the corresponding part in theCorrNet architecture, and comprises the same number of layers with thesame values of hyper-parameters and activation layers. This was done toensure the isolation the effect of the architecture and the definitionof the loss functions from other factors.

Additionally, the correlational model was tested with respect to itsability to learn a dual-view correlated mapping of both modalities (textand image) given just a single modality as input, as follows:

-   -   Self-reconstruction: Ability to reconstruct a single view, given        the same view as input.    -   Cross-reconstruction: Ability to reconstruct a single view,        given the other view as input (i.e., image from text and        vice-versa).

For the task of document classification, accuracy was tested withrespect to the following classification tasks:

-   -   Single-view classification: Classification accuracy given a        single original view as input (either text or image).    -   Single-view, cross-reconstructed classification: Classification        accuracy given a single reconstructed view as input (either text        reconstructed from original image or vice versa).    -   Multi-modal classification: Classification accuracy given both        input modalities (text and image).    -   Multi-modal, cross-reconstructed-views classification:        Classification accuracy given multi-modal reconstructions of the        original input (each modality reconstructed by the model        separately, and the concatenation of both reconstructed views        given as input).    -   Multi-modal, semi-reconstructed views classification:        Classification accuracy given multi-modal semi-reconstructed        input (a concatenation of the original text with the        reconstruction of the image from the text, and vice versa).

Table 1 below provides test results for each of the above-referencedclassification tasks, where:

-   -   Precision (also called positive predictive value) refers to the        number of correct positive results divided by the number of all        positive results returned;    -   Recall (also known as sensitivity) refers to the number of        correct positive results divided by the number of all relevant        samples; and    -   F1 score refers to overall accuracy, based on the harmonic        average of both the precision and the recall scores.

TABLE 1 Test Results Method Precision Recall F1 Score Single view - Text0.89 0.88 0.88 Single view - Image 0.75 0.75 0.74 Single view -Reconstructed Text 0.70 0.70 0.69 Single view - Reconstructed Image 0.830.83 0.82 Multi-Modal View - Text + Image 0.91 0.90 0.89 Multi-ModalView - Reconstructed 0.86 0.85 0.85 Text, Reconstructed ImageMulti-Modal View - Reconstructed 0.72 0.72 0.72 Text, Image Multi-ModalView - Text, 0.91 0.91 0.90 Reconstructed Image Reference Model - TextOnly 0.88 0.88 0.87 Reference Model - Image 0.74 0.70 0.71

The values given in table 1 suggest that the multi-modal model of thepresent invention is capable of generating a representation useful for aclassification that outperforms classifications based on representationsgenerated by similar approaches. Given two inputs, even if one of theinputs is a reconstruction, a fruitful representation can be achieved.Accordingly, the present model allows for effective documentclassification, by utilizing the multi-view correlation to maximize theabundant information prevailing not just in the textual content but alsoin the visual structure. Once the CorrNet architecture is trained onexisting dual-view inputs, a single view is sufficient to obtain anamplified representation ready for classification.

The present invention may be a system, a method, and/or a computerprogram product. The computer program product may include a computerreadable storage medium (or media) having computer readable programinstructions thereon for causing a processor to carry out aspects of thepresent invention.

The computer readable storage medium can be a tangible device that canretain and store instructions for use by an instruction executiondevice. The computer readable storage medium may be, for example, but isnot limited to, an electronic storage device, a magnetic storage device,an optical storage device, an electromagnetic storage device, asemiconductor storage device, or any suitable combination of theforegoing. A non-exhaustive list of more specific examples of thecomputer readable storage medium includes the following: a portablecomputer diskette, a hard disk, a random access memory (RAM), aread-only memory (ROM), an erasable programmable read-only memory (EPROMor Flash memory), a static random access memory (SRAM), a portablecompact disc read-only memory (CD-ROM), a digital versatile disk (DVD),a memory stick, a floppy disk, a mechanically encoded device havinginstructions recorded thereon, and any suitable combination of theforegoing. A computer readable storage medium, as used herein, is not tobe construed as being transitory signals per se, such as radio waves orother freely propagating electromagnetic waves, electromagnetic wavespropagating through a waveguide or other transmission media (e.g., lightpulses passing through a fiber-optic cable), or electrical signalstransmitted through a wire. Rather, the computer readable storage mediumis a non-transient (i.e., not-volatile) medium.

Computer readable program instructions described herein can bedownloaded to respective computing/processing devices from a computerreadable storage medium or to an external computer or external storagedevice via a network, for example, the Internet, a local area network, awide area network and/or a wireless network. The network may comprisecopper transmission cables, optical transmission fibers, wirelesstransmission, routers, firewalls, switches, gateway computers and/oredge servers. A network adapter card or network interface in eachcomputing/processing device receives computer readable programinstructions from the network and forwards the computer readable programinstructions for storage in a computer readable storage medium withinthe respective computing/processing device.

Computer readable program instructions for carrying out operations ofthe present invention may be assembler instructions,instruction-set-architecture (ISA) instructions, machine instructions,machine dependent instructions, microcode, firmware instructions,state-setting data, or either source code or object code written in anycombination of one or more programming languages, including an objectoriented programming language such as Java, Smalltalk, C++ or the like,and conventional procedural programming languages, such as the “C”programming language or similar programming languages. The computerreadable program instructions may execute entirely on the user'scomputer, partly on the user's computer, as a stand-alone softwarepackage, partly on the user's computer and partly on a remote computeror entirely on the remote computer or server. In the latter scenario,the remote computer may be connected to the user's computer through anytype of network, including a local area network (LAN) or a wide areanetwork (WAN), or the connection may be made to an external computer(for example, through the Internet using an Internet Service Provider).In some embodiments, electronic circuitry including, for example,programmable logic circuitry, field-programmable gate arrays (FPGA), orprogrammable logic arrays (PLA) may execute the computer readableprogram instructions by utilizing state information of the computerreadable program instructions to personalize the electronic circuitry,in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems), and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer readable program instructions.

These computer readable program instructions may be provided to aprocessor of a general-purpose computer, special purpose computer, orother programmable data processing apparatus to produce a machine, suchthat the instructions, which execute via the processor of the computeror other programmable data processing apparatus, create means forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks. These computer readable program instructionsmay also be stored in a computer readable storage medium that can directa computer, a programmable data processing apparatus, and/or otherdevices to function in a particular manner, such that the computerreadable storage medium having instructions stored therein comprises anarticle of manufacture including instructions which implement aspects ofthe function/act specified in the flowchart and/or block diagram blockor blocks.

The computer readable program instructions may also be loaded onto acomputer, other programmable data processing apparatus, or other deviceto cause a series of operational steps to be performed on the computer,other programmable apparatus or other device to produce a computerimplemented process, such that the instructions which execute on thecomputer, other programmable apparatus, or other device implement thefunctions/acts specified in the flowchart and/or block diagram block orblocks.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods, and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof instructions, which comprises one or more executable instructions forimplementing the specified logical function(s). In some alternativeimplementations, the functions noted in the block may occur out of theorder noted in the figures. For example, two blocks shown in successionmay, in fact, be executed substantially concurrently, or the blocks maysometimes be executed in the reverse order, depending upon thefunctionality involved. It will also be noted that each block of theblock diagrams and/or flowchart illustration, and combinations of blocksin the block diagrams and/or flowchart illustration, can be implementedby special purpose hardware-based systems that perform the specifiedfunctions or acts or carry out combinations of special purpose hardwareand computer instructions.

The descriptions of the various embodiments of the present inventionhave been presented for purposes of illustration, but are not intendedto be exhaustive or limited to the embodiments disclosed. Manymodifications and variations will be apparent to those of ordinary skillin the art without departing from the scope and spirit of the describedembodiments. The terminology used herein was chosen to best explain theprinciples of the embodiments, the practical application or technicalimprovement over technologies found in the marketplace, or to enableothers of ordinary skill in the art to understand the embodimentsdisclosed herein.

1. A method comprising: receiving, at a computer, an electronic documenton which to train a machine learning classifier; applying, by thecomputer, a first neural network to raw text extracted from theelectronic document to determine a textual data representation of theelectronic document; applying, by the computer, a second neural networkto a raster image extracted from the electronic document to determine avisual data representation of the electronic document; generating, bythe computer, a fusion representation based on the textual datarepresentation and the visual data representation of the electronicdocument; and applying, by the computer, the machine learning classifierbased on the fusion representation to classify one or more newelectronic documents.
 2. The method of claim 1, wherein the generatingis further based on a label associated with the electronic document, thelabel denoting a document category.
 3. The method of claim 1, whereinthe applying the first neural network further comprises: generating, bythe computer, the textual data representation of said extracted text asa fixed length vector.
 4. The method of claim 1, wherein the generatingfurther comprises: generating, by the computer, the fusionrepresentation based on a correlation between the textual datarepresentation and the visual data representation, the textual datarepresentation, and the visual data representation.
 5. The method ofclaim 1, wherein the first neural network is different from the secondneural network. 6-7. (canceled)
 8. The method of claim 1, wherein thegenerating is based, at least in part, on a cost function whichmaximizes a correlation between the textual data representation and thevisual data representation.
 9. The method of claim 1, wherein theapplying the machine learning classifier further comprises: classifying,by the computer, the one or more new electronic documents with themachine learning classifier based on one of textual content and rasterimage of the one or more new electronic documents.
 10. A computingdevice comprising: a memory containing machine readable mediumcomprising machine executable code having stored thereon instructionsfor performing a method of multi-modal electronic documentclassification; a processor coupled to the memory, the processorconfigured to execute the machine executable code to cause the processorto: receive, as input, an electronic document on which to train amachine learning classifier for the multi-modal electronic documentclassification; apply a first neural network to raw text extracted fromthe electronic document to determine a textual data representation;apply a second neural network to an image extracted from the electronicdocument to determine a visual data representation; calculate acorrelation between the textual data representation and the visual datarepresentation; generate a fusion representation based on thecorrelation, the textual data representation, and the visual datarepresentation; and (iii) apply the machine learning classifier based onthe fusion representation to classify a new electronic document.
 11. Thecomputing device of claim 10, wherein the generation of the fusionrepresentation is further based on a label denoting a document category.12. The computing device of claim 10, wherein the processor is furtherconfigured to execute the machine executable code to cause theprocessor, as part of the application of the first neural network to:generate the textual data representation of the raw text as a fixedlength vector.
 13. The computing device of claim 10, wherein the firstneural network is different from the second neural network.
 14. Thecomputing device of claim 10, wherein the first neural network is thesame as the second neural network. 15-16. (canceled)
 17. The computingdevice of claim 10, wherein the generation of the fusion representationis based, at least in part, on a cost function which maximizes thecorrelation between the textual data representation and the visual datarepresentation.
 18. (canceled)
 19. A non-transitory machine readablemedium having stored thereon instructions for performing a methodcomprising machine executable code which when executed by at least onemachine, causes the machine to: extract raw text and an image from anelectronic document on which to train a machine learning classifier;apply a first neural network to the raw text to determine a textual datarepresentation of the electronic document; apply a second neural networkto the image to determine a visual data representation of the electronicdocument; generate a fusion representation based on the textual datarepresentation, the visual data representation, and a correlationbetween the textual data representation and the visual datarepresentation; and apply the machine learning classifier based on thefusion representation to classify one or more new electronic documents.20. (canceled)
 21. The non-transitory machine readable medium of claim19, further comprising machine executable code which when executed bythe at least one machine causes the machine to: generate the textualdata representation of the raw text as a fixed length vector.
 22. Thenon-transitory machine readable medium of claim 19, wherein the firstneural network is different from the second neural network. 23.(canceled)
 24. The non-transitory machine readable medium of claim 19,further comprising machine executable code which when executed by the atleast one machine causes the machine to: calculate the correlationbetween textual data representation and visual data representation. 25.(canceled)
 26. The non-transitory machine readable medium of claim 19,further comprising machine executable code which when executed by the atleast one machine causes the machine to: generate the fusionrepresentation based, at least in part, on a cost function whichmaximizes the correlation.
 27. The non-transitory machine readablemedium of claim 19, further comprising machine executable code whichwhen executed by the at least one machine causes the machine to:classify the one or more new electronic documents with the machinelearning classifier based on one of textual content and raster image ofthe one or more new electronic documents.